home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
SGI Freeware 1999 August
/
SGI Freeware 1999 August.iso
/
dist
/
fw_xemacs.idb
/
usr
/
freeware
/
lib
/
xemacs-20.4
/
info
/
lispref.info-39.z
/
lispref.info-39
Encoding:
Amiga
Atari
Commodore
DOS
FM Towns/JPY
Macintosh
Macintosh JP
Macintosh to JP
NeXTSTEP
RISC OS/Acorn
Shift JIS
UTF-8
Wrap
GNU Info File
|
1998-05-21
|
50.8 KB
|
1,246 lines
This is Info file ../../info/lispref.info, produced by Makeinfo version
1.68 from the input file lispref.texi.
Edition History:
GNU Emacs Lisp Reference Manual Second Edition (v2.01), May 1993 GNU
Emacs Lisp Reference Manual Further Revised (v2.02), August 1993 Lucid
Emacs Lisp Reference Manual (for 19.10) First Edition, March 1994
XEmacs Lisp Programmer's Manual (for 19.12) Second Edition, April 1995
GNU Emacs Lisp Reference Manual v2.4, June 1995 XEmacs Lisp
Programmer's Manual (for 19.13) Third Edition, July 1995 XEmacs Lisp
Reference Manual (for 19.14 and 20.0) v3.1, March 1996 XEmacs Lisp
Reference Manual (for 19.15 and 20.1, 20.2) v3.2, April, May 1997
Copyright (C) 1990, 1991, 1992, 1993, 1994, 1995 Free Software
Foundation, Inc. Copyright (C) 1994, 1995 Sun Microsystems, Inc.
Copyright (C) 1995, 1996 Ben Wing.
Permission is granted to make and distribute verbatim copies of this
manual provided the copyright notice and this permission notice are
preserved on all copies.
Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided that the
entire resulting derived work is distributed under the terms of a
permission notice identical to this one.
Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that this permission notice may be stated in a
translation approved by the Foundation.
Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided also
that the section entitled "GNU General Public License" is included
exactly as in the original, and provided that the entire resulting
derived work is distributed under the terms of a permission notice
identical to this one.
Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that the section entitled "GNU General Public License"
may be included in a translation approved by the Free Software
Foundation instead of in the original English.
File: lispref.info, Node: Elisp Interface for Receiving Messages, Prev: Example of Receiving Messages, Up: Receiving Messages
Elisp Interface for Receiving Messages
--------------------------------------
- Function: make-tooltalk-pattern ATTRIBUTES
Create a ToolTalk pattern and initialize its attributes. The
value of attributes must be a list of alternating keyword/values,
where keywords are symbols that name valid pattern attributes or
lists of valid attributes. For example:
(make-tooltalk-pattern
'(category TT_OBSERVE
scope TT_SESSION
op ("operation1" "operation2")
args ("arg1" 12345 (TT_INOUT "arg3" "string"))))
Attribute names are the same as those supported by
`add-tooltalk-pattern-attribute', plus `'args'.
Values must always be strings, integers, or symbols that represent
ToolTalk constants or lists of same. When a list of values is
provided all of the list elements are added to the attribute. In
the example above, messages whose `op' attribute is `"operation1"'
or `"operation2"' would match the pattern.
The value of ARGS should be a list of pattern arguments where each
pattern argument has the following form:
`(mode [value [type]])' or just `value'
Where MODE is one of `TT_IN', `TT_OUT', or `TT_INOUT' and TYPE is
a string. If TYPE isn't specified then `int' is used if VALUE is
a number; otherwise `string' is used. If TYPE is `string' then
VALUE is converted to a string (if it isn't a string already) with
`prin1-to-string'. If only a value is specified then MODE
defaults to `TT_IN'. If MODE is `TT_OUT' then VALUE and TYPE
don't need to be specified. You can find out more about the
semantics and uses of ToolTalk pattern arguments in chapter 3 of
the `ToolTalk Programmer's Guide'.
- Function: register-tooltalk-pattern PAT
XEmacs will begin receiving messages that match this pattern.
- Function: unregister-tooltalk-pattern PAT
XEmacs will stop receiving messages that match this pattern.
- Function: add-tooltalk-pattern-attribute VALUE PAT INDICATOR
Add one value to the indicated pattern attribute. The names of
attributes are the same as the ToolTalk accessors used to set them
less the `tooltalk_pattern_' prefix and the `_add' suffix. For
example, the name of the attribute for the
`tt_pattern_disposition_add' attribute is `disposition'. The
`category' attribute is handled specially, since a pattern can only
be a member of one category (`TT_OBSERVE' or `TT_HANDLE').
Callbacks are handled slightly differently than in the C ToolTalk
API. The value of CALLBACK should be the name of a function of one
argument. It will be called each time the pattern matches an
incoming message.
- Function: add-tooltalk-pattern-arg PAT MODE TYPE VALUE
Add one fully-specified argument to a ToolTalk pattern. MODE must
be one of `TT_IN', `TT_INOUT', or `TT_OUT'. TYPE must be a
string. VALUE can be an integer, string or `nil'. If VALUE is an
integer then an integer argument (`tt_pattern_iarg_add') is added;
otherwise a string argument is added. At present there's no way
to add a binary data argument.
- Function: create-tooltalk-pattern
Create a new ToolTalk pattern and initialize its session attribute
to be the default session.
- Function: destroy-tooltalk-pattern PAT
Apply `tt_pattern_destroy' to the pattern. This effectively
unregisters the pattern.
- Function: describe-tooltalk-message MSG &optional STREAM
Print the message's attributes and arguments to STREAM. This is
often useful for debugging.
File: lispref.info, Node: Internationalization, Next: MULE, Prev: ToolTalk Support, Up: Top
Internationalization
********************
* Menu:
* I18N Levels 1 and 2:: Support for different time, date, and currency formats.
* I18N Level 3:: Support for localized messages.
* I18N Level 4:: Support for Asian languages.
File: lispref.info, Node: I18N Levels 1 and 2, Next: I18N Level 3, Up: Internationalization
I18N Levels 1 and 2
===================
XEmacs is now compliant with I18N levels 1 and 2. Specifically,
this means that it is 8-bit clean and correctly handles time and date
functions. XEmacs will correctly display the entire ISO-Latin 1
character set.
The compose key may now be used to create any character in the
ISO-Latin 1 character set not directly available via the keyboard.. In
order for the compose key to work it is necessary to load the file
`x-compose.el'. At any time while composing a character, `C-h' will
display all valid completions and the character which would be produced.
File: lispref.info, Node: I18N Level 3, Next: I18N Level 4, Prev: I18N Levels 1 and 2, Up: Internationalization
I18N Level 3
============
* Menu:
* Level 3 Basics::
* Level 3 Primitives::
* Dynamic Messaging::
* Domain Specification::
* Documentation String Extraction::
File: lispref.info, Node: Level 3 Basics, Next: Level 3 Primitives, Up: I18N Level 3
Level 3 Basics
--------------
XEmacs now provides alpha-level functionality for I18N Level 3.
This means that everything necessary for full messaging is available,
but not every file has been converted.
The two message files which have been created are `src/emacs.po' and
`lisp/packages/mh-e.po'. Both files need to be converted using
`msgfmt', and the resulting `.mo' files placed in some locale's
`LC_MESSAGES' directory. The test "translations" in these files are
the original messages prefixed by `TRNSLT_'.
The domain for a variable is stored on the variable's property list
under the property name VARIABLE-DOMAIN. The function
`documentation-property' uses this information when translating a
variable's documentation.
File: lispref.info, Node: Level 3 Primitives, Next: Dynamic Messaging, Prev: Level 3 Basics, Up: I18N Level 3
Level 3 Primitives
------------------
- Function: gettext STRING
This function looks up STRING in the default message domain and
returns its translation. If `I18N3' was not enabled when XEmacs
was compiled, it just returns STRING.
- Function: dgettext DOMAIN STRING
This function looks up STRING in the specified message domain and
returns its translation. If `I18N3' was not enabled when XEmacs
was compiled, it just returns STRING.
- Function: bind-text-domain DOMAIN PATHNAME
This function associates a pathname with a message domain. Here's
how the path to message file is constructed under SunOS 5.x:
`{pathname}/{LANG}/LC_MESSAGES/{domain}.mo'
If `I18N3' was not enabled when XEmacs was compiled, this function
does nothing.
- Special Form: domain STRING
This function specifies the text domain used for translating
documentation strings and interactive prompts of a function. For
example, write:
(defun foo (arg) "Doc string" (domain "emacs-foo") ...)
to specify `emacs-foo' as the text domain of the function `foo'.
The "call" to `domain' is actually a declaration rather than a
function; when actually called, `domain' just returns `nil'.
- Function: domain-of FUNCTION
This function returns the text domain of FUNCTION; it returns
`nil' if it is the default domain. If `I18N3' was not enabled
when XEmacs was compiled, it always returns `nil'.
File: lispref.info, Node: Dynamic Messaging, Next: Domain Specification, Prev: Level 3 Primitives, Up: I18N Level 3
Dynamic Messaging
-----------------
The `format' function has been extended to permit you to change the
order of parameter insertion. For example, the conversion format
`%1$s' inserts parameter one as a string, while `%2$s' inserts
parameter two. This is useful when creating translations which require
you to change the word order.
File: lispref.info, Node: Domain Specification, Next: Documentation String Extraction, Prev: Dynamic Messaging, Up: I18N Level 3
Domain Specification
--------------------
The default message domain of XEmacs is `emacs'. For add-on
packages, it is best to use a different domain. For example, let us
say we want to convert the "gorilla" package to use the domain
`emacs-gorilla'. To translate the message "What gorilla?", use
`dgettext' as follows:
(dgettext "emacs-gorilla" "What gorilla?")
A function (or macro) which has a documentation string or an
interactive prompt needs to be associated with the domain in order for
the documentation or prompt to be translated. This is done with the
`domain' special form as follows:
(defun scratch (location)
"Scratch the specified location."
(domain "emacs-gorilla")
(interactive "sScratch: ")
... )
It is most efficient to specify the domain in the first line of the
function body, before the `interactive' form.
For variables and constants which have documentation strings,
specify the domain after the documentation.
- Special Form: defvar SYMBOL [VALUE [DOC-STRING [DOMAIN]]]
Example:
(defvar weight 250 "Weight of gorilla, in pounds." "emacs-gorilla")
- Special Form: defconst SYMBOL [VALUE [DOC-STRING [DOMAIN]]]
Example:
(defconst limbs 4 "Number of limbs" "emacs-gorilla")
Autoloaded functions which are specified in `loaddefs.el' do not need
to have a domain specification, because their documentation strings are
extracted into the main message base. However, for autoloaded functions
which are specified in a separate package, use following syntax:
- Function: autoload SYMBOL FILENAME &optional DOCSTRING INTERACTIVE
MACRO DOMAIN
Example:
(autoload 'explore "jungle" "Explore the jungle." nil nil "emacs-gorilla")
File: lispref.info, Node: Documentation String Extraction, Prev: Domain Specification, Up: I18N Level 3
Documentation String Extraction
-------------------------------
The utility `etc/make-po' scans the file `DOC' to extract
documentation strings and creates a message file `doc.po'. This file
may then be inserted within `emacs.po'.
Currently, `make-po' is hard-coded to read from `DOC' and write to
`doc.po'. In order to extract documentation strings from an add-on
package, first run `make-docfile' on the package to produce the `DOC'
file. Then run `make-po -p' with the `-p' argument to indicate that we
are extracting documentation for an add-on package.
(The `-p' argument is a kludge to make up for a subtle difference
between pre-loaded documentation and add-on documentation: For add-on
packages, the final carriage returns in the strings produced by
`make-docfile' must be ignored.)
File: lispref.info, Node: I18N Level 4, Prev: I18N Level 3, Up: Internationalization
I18N Level 4
============
The Asian-language support in XEmacs is called "MULE". *Note MULE::.
File: lispref.info, Node: MULE, Next: Tips, Prev: Internationalization, Up: Top
MULE
****
"MULE" is the name originally given to the version of GNU Emacs
extended for multi-lingual (and in particular Asian-language) support.
"MULE" is short for "MUlti-Lingual Emacs". It was originally called
Nemacs ("Nihon Emacs" where "Nihon" is the Japanese word for "Japan"),
when it only provided support for Japanese. XEmacs refers to its
multi-lingual support as "MULE support" since it is based on "MULE".
* Menu:
* Internationalization Terminology::
Definition of various internationalization terms.
* Charsets:: Sets of related characters.
* MULE Characters:: Working with characters in XEmacs/MULE.
* Composite Characters:: Making new characters by overstriking other ones.
* ISO 2022:: An international standard for charsets and encodings.
* Coding Systems:: Ways of representing a string of chars using integers.
* CCL:: A special language for writing fast converters.
* Category Tables:: Subdividing charsets into groups.
File: lispref.info, Node: Internationalization Terminology, Next: Charsets, Up: MULE
Internationalization Terminology
================================
In internationalization terminology, a string of text is divided up
into "characters", which are the printable units that make up the text.
A single character is (for example) a capital `A', the number `2', a
Katakana character, a Kanji ideograph (an "ideograph" is a "picture"
character, such as is used in Japanese Kanji, Chinese Hanzi, and Korean
Hangul; typically there are thousands of such ideographs in each
language), etc. The basic property of a character is its shape. Note
that the same character may be drawn by two different people (or in two
different fonts) in slightly different ways, although the basic shape
will be the same.
In some cases, the differences will be significant enough that it is
actually possible to identify two or more distinct shapes that both
represent the same character. For example, the lowercase letters `a'
and `g' each have two distinct possible shapes - the `a' can optionally
have a curved tail projecting off the top, and the `g' can be formed
either of two loops, or of one loop and a tail hanging off the bottom.
Such distinct possible shapes of a character are called "glyphs". The
important characteristic of two glyphs making up the same character is
that the choice between one or the other is purely stylistic and has no
linguistic effect on a word (this is the reason why a capital `A' and
lowercase `a' are different characters rather than different glyphs -
e.g. `Aspen' is a city while `aspen' is a kind of tree).
Note that "character" and "glyph" are used differently here than
elsewhere in XEmacs.
A "character set" is simply a set of related characters. ASCII, for
example, is a set of 94 characters (or 128, if you count non-printing
characters). Other character sets are ISO8859-1 (ASCII plus various
accented characters and other international symbols), JISX0201 (ASCII,
more or less, plus half-width Katakana), JISX0208 (Japanese Kanji),
JISX0212 (a second set of less-used Japanese Kanji), GB2312 (Mainland
Chinese Hanzi), etc.
Every character set has one or more "orderings", which can be viewed
as a way of assigning a number (or set of numbers) to each character in
the set. For most character sets, there is a standard ordering, and in
fact all of the character sets mentioned above define a particular
ordering. ASCII, for example, places letters in their "natural" order,
puts uppercase letters before lowercase letters, numbers before
letters, etc. Note that for many of the Asian character sets, there is
no natural ordering of the characters. The actual orderings are based
on one or more salient characteristic, of which there are many to
choose from - e.g. number of strokes, common radicals, phonetic
ordering, etc.
The set of numbers assigned to any particular character are called
the character's "position codes". The number of position codes
required to index a particular character in a character set is called
the "dimension" of the character set. ASCII, being a relatively small
character set, is of dimension one, and each character in the set is
indexed using a single position code, in the range 0 through 127 (if
non-printing characters are included) or 33 through 126 (if only the
printing characters are considered). JISX0208, i.e. Japanese Kanji,
has thousands of characters, and is of dimension two - every character
is indexed by two position codes, each in the range 33 through 126.
(Note that the choice of the range here is somewhat arbitrary.
Although a character set such as JISX0208 defines an *ordering* of all
its characters, it does not define the actual mapping between numbers
and characters. You could just as easily index the characters in
JISX0208 using numbers in the range 0 through 93, 1 through 94, 2
through 95, etc. The reason for the actual range chosen is so that the
position codes match up with the actual values used in the common
encodings.)
An "encoding" is a way of numerically representing characters from
one or more character sets into a stream of like-sized numerical values
called "words"; typically these are 8-bit, 16-bit, or 32-bit
quantities. If an encoding encompasses only one character set, then the
position codes for the characters in that character set could be used
directly. (This is the case with ASCII, and as a result, most people do
not understand the difference between a character set and an encoding.)
This is not possible, however, if more than one character set is to be
used in the encoding. For example, printed Japanese text typically
requires characters from multiple character sets - ASCII, JISX0208, and
JISX0212, to be specific. Each of these is indexed using one or more
position codes in the range 33 through 126, so the position codes could
not be used directly or there would be no way to tell which character
was meant. Different Japanese encodings handle this differently - JIS
uses special escape characters to denote different character sets; EUC
sets the high bit of the position codes for JISX0208 and JISX0212, and
puts a special extra byte before each JISX0212 character; etc. (JIS,
EUC, and most of the other encodings you will encounter are 7-bit or
8-bit encodings. There is one common 16-bit encoding, which is Unicode;
this strives to represent all the world's characters in a single large
character set. 32-bit encodings are generally used internally in
programs to simplify the code that manipulates them; however, they are
not much used externally because they are not very space-efficient.)
Encodings are classified as either "modal" or "non-modal". In a
"modal encoding", there are multiple states that the encoding can be in,
and the interpretation of the values in the stream depends on the
current global state of the encoding. Special values in the encoding,
called "escape sequences", are used to change the global state. JIS,
for example, is a modal encoding. The bytes `ESC $ B' indicate that,
from then on, bytes are to be interpreted as position codes for
JISX0208, rather than as ASCII. This effect is cancelled using the
bytes `ESC ( B', which mean "switch from whatever the current state is
to ASCII". To switch to JISX0212, the escape sequence `ESC $ ( D'.
(Note that here, as is common, the escape sequences do in fact begin
with `ESC'. This is not necessarily the case, however.)
A "non-modal encoding" has no global state that extends past the
character currently being interpreted. EUC, for example, is a
non-modal encoding. Characters in JISX0208 are encoded by setting the
high bit of the position codes, and characters in JISX0212 are encoded
by doing the same but also prefixing the character with the byte 0x8F.
The advantage of a modal encoding is that it is generally more
space-efficient, and is easily extendable because there are essentially
an arbitrary number of escape sequences that can be created. The
disadvantage, however, is that it is much more difficult to work with
if it is not being processed in a sequential manner. In the non-modal
EUC encoding, for example, the byte 0x41 always refers to the letter
`A'; whereas in JIS, it could either be the letter `A', or one of the
two position codes in a JISX0208 character, or one of the two position
codes in a JISX0212 character. Determining exactly which one is meant
could be difficult and time-consuming if the previous bytes in the
string have not already been processed.
Non-modal encodings are further divided into "fixed-width" and
"variable-width" formats. A fixed-width encoding always uses the same
number of words per character, whereas a variable-width encoding does
not. EUC is a good example of a variable-width encoding: one to three
bytes are used per character, depending on the character set. 16-bit
and 32-bit encodings are nearly always fixed-width, and this is in fact
one of the main reasons for using an encoding with a larger word size.
The advantages of fixed-width encodings should be obvious. The
advantages of variable-width encodings are that they are generally more
space-efficient and allow for compatibility with existing 8-bit
encodings such as ASCII.
Note that the bytes in an 8-bit encoding are often referred to as
"octets" rather than simply as bytes. This terminology dates back to
the days before 8-bit bytes were universal, when some computers had
9-bit bytes, others had 10-bit bytes, etc.
File: lispref.info, Node: Charsets, Next: MULE Characters, Prev: Internationalization Terminology, Up: MULE
Charsets
========
A "charset" in MULE is an object that encapsulates a particular
character set as well as an ordering of those characters. Charsets are
permanent objects and are named using symbols, like faces.
- Function: charsetp OBJECT
This function returns non-`nil' if OBJECT is a charset.
* Menu:
* Charset Properties:: Properties of a charset.
* Basic Charset Functions:: Functions for working with charsets.
* Charset Property Functions:: Functions for accessing charset properties.
* Predefined Charsets:: Predefined charset objects.
File: lispref.info, Node: Charset Properties, Next: Basic Charset Functions, Up: Charsets
Charset Properties
------------------
Charsets have the following properties:
`name'
A symbol naming the charset. Every charset must have a different
name; this allows a charset to be referred to using its name
rather than the actual charset object.
`doc-string'
A documentation string describing the charset.
`registry'
A regular expression matching the font registry field for this
character set. For example, both the `ascii' and `latin-iso8859-1'
charsets use the registry `"ISO8859-1"'. This field is used to
choose an appropriate font when the user gives a general font
specification such as `-*-courier-medium-r-*-140-*', i.e. a
14-point upright medium-weight Courier font.
`dimension'
Number of position codes used to index a character in the
character set. XEmacs/MULE can only handle character sets of
dimension 1 or 2. This property defaults to 1.
`chars'
Number of characters in each dimension. In XEmacs/MULE, the only
allowed values are 94 or 96. (There are a couple of pre-defined
character sets, such as ASCII, that do not follow this, but you
cannot define new ones like this.) Defaults to 94. Note that if
the dimension is 2, the character set thus described is 94x94 or
96x96.
`columns'
Number of columns used to display a character in this charset.
Only used in TTY mode. (Under X, the actual width of a character
can be derived from the font used to display the characters.) If
unspecified, defaults to the dimension. (This is almost always the
correct value, because character sets with dimension 2 are usually
ideograph character sets, which need two columns to display the
intricate ideographs.)
`direction'
A symbol, either `l2r' (left-to-right) or `r2l' (right-to-left).
Defaults to `l2r'. This specifies the direction that the text
should be displayed in, and will be left-to-right for most
charsets but right-to-left for Hebrew and Arabic. (Right-to-left
display is not currently implemented.)
`final'
Final byte of the standard ISO 2022 escape sequence designating
this charset. Must be supplied. Each combination of (DIMENSION,
CHARS) defines a separate namespace for final bytes, and each
charset within a particular namespace must have a different final
byte. Note that ISO 2022 restricts the final byte to the range
0x30 - 0x7E if dimension == 1, and 0x30 - 0x5F if dimension == 2.
Note also that final bytes in the range 0x30 - 0x3F are reserved
for user-defined (not official) character sets. For more
information on ISO 2022, see *Note Coding Systems::.
`graphic'
0 (use left half of font on output) or 1 (use right half of font on
output). Defaults to 0. This specifies how to convert the
position codes that index a character in a character set into an
index into the font used to display the character set. With
`graphic' set to 0, position codes 33 through 126 map to font
indices 33 through 126; with it set to 1, position codes 33
through 126 map to font indices 161 through 254 (i.e. the same
number but with the high bit set). For example, for a font whose
registry is ISO8859-1, the left half of the font (octets 0x20 -
0x7F) is the `ascii' charset, while the right half (octets 0xA0 -
0xFF) is the `latin-iso8859-1' charset.
`ccl-program'
A compiled CCL program used to convert a character in this charset
into an index into the font. This is in addition to the `graphic'
property. If a CCL program is defined, the position codes of a
character will first be processed according to `graphic' and then
passed through the CCL program, with the resulting values used to
index the font.
This is used, for example, in the Big5 character set (used in
Taiwan). This character set is not ISO-2022-compliant, and its
size (94x157) does not fit within the maximum 96x96 size of
ISO-2022-compliant character sets. As a result, XEmacs/MULE
splits it (in a rather complex fashion, so as to group the most
commonly used characters together) into two charset objects
(`big5-1' and `big5-2'), each of size 94x94, and each charset
object uses a CCL program to convert the modified position codes
back into standard Big5 indices to retrieve a character from a
Big5 font.
Most of the above properties can only be changed when the charset is
created. *Note Charset Property Functions::.
File: lispref.info, Node: Basic Charset Functions, Next: Charset Property Functions, Prev: Charset Properties, Up: Charsets
Basic Charset Functions
-----------------------
- Function: find-charset CHARSET-OR-NAME
This function retrieves the charset of the given name. If
CHARSET-OR-NAME is a charset object, it is simply returned.
Otherwise, CHARSET-OR-NAME should be a symbol. If there is no
such charset, `nil' is returned. Otherwise the associated charset
object is returned.
- Function: get-charset NAME
This function retrieves the charset of the given name. Same as
`find-charset' except an error is signalled if there is no such
charset instead of returning `nil'.
- Function: charset-list
This function returns a list of the names of all defined charsets.
- Function: make-charset NAME DOC-STRING PROPS
This function defines a new character set. This function is for
use with Mule support. NAME is a symbol, the name by which the
character set is normally referred. DOC-STRING is a string
describing the character set. PROPS is a property list,
describing the specific nature of the character set. The
recognized properties are `registry', `dimension', `columns',
`chars', `final', `graphic', `direction', and `ccl-program', as
previously described.
- Function: make-reverse-direction-charset CHARSET NEW-NAME
This function makes a charset equivalent to CHARSET but which goes
in the opposite direction. NEW-NAME is the name of the new
charset. The new charset is returned.
- Function: charset-from-attributes DIMENSION CHARS FINAL &optional
DIRECTION
This function returns a charset with the given DIMENSION, CHARS,
FINAL, and DIRECTION. If DIRECTION is omitted, both directions
will be checked (left-to-right will be returned if character sets
exist for both directions).
- Function: charset-reverse-direction-charset CHARSET
This function returns the charset (if any) with the same dimension,
number of characters, and final byte as CHARSET, but which is
displayed in the opposite direction.
File: lispref.info, Node: Charset Property Functions, Next: Predefined Charsets, Prev: Basic Charset Functions, Up: Charsets
Charset Property Functions
--------------------------
All of these functions accept either a charset name or charset
object.
- Function: charset-property CHARSET PROP
This function returns property PROP of CHARSET. *Note Charset
Properties::.
Convenience functions are also provided for retrieving individual
properties of a charset.
- Function: charset-name CHARSET
This function returns the name of CHARSET. This will be a symbol.
- Function: charset-doc-string CHARSET
This function returns the doc string of CHARSET.
- Function: charset-registry CHARSET
This function returns the registry of CHARSET.
- Function: charset-dimension CHARSET
This function returns the dimension of CHARSET.
- Function: charset-chars CHARSET
This function returns the number of characters per dimension of
CHARSET.
- Function: charset-columns CHARSET
This function returns the number of display columns per character
(in TTY mode) of CHARSET.
- Function: charset-direction CHARSET
This function returns the display direction of CHARSET - either
`l2r' or `r2l'.
- Function: charset-final CHARSET
This function returns the final byte of the ISO 2022 escape
sequence designating CHARSET.
- Function: charset-graphic CHARSET
This function returns either 0 or 1, depending on whether the
position codes of characters in CHARSET map to the left or right
half of their font, respectively.
- Function: charset-ccl-program CHARSET
This function returns the CCL program, if any, for converting
position codes of characters in CHARSET into font indices.
The only property of a charset that can currently be set after the
charset has been created is the CCL program.
- Function: set-charset-ccl-program CHARSET CCL-PROGRAM
This function sets the `ccl-program' property of CHARSET to
CCL-PROGRAM.
File: lispref.info, Node: Predefined Charsets, Prev: Charset Property Functions, Up: Charsets
Predefined Charsets
-------------------
The following charsets are predefined in the C code.
Name Type Fi Gr Dir Registry
--------------------------------------------------------------
ascii 94 B 0 l2r ISO8859-1
control-1 94 0 l2r ---
latin-iso8859-1 94 A 1 l2r ISO8859-1
latin-iso8859-2 96 B 1 l2r ISO8859-2
latin-iso8859-3 96 C 1 l2r ISO8859-3
latin-iso8859-4 96 D 1 l2r ISO8859-4
cyrillic-iso8859-5 96 L 1 l2r ISO8859-5
arabic-iso8859-6 96 G 1 r2l ISO8859-6
greek-iso8859-7 96 F 1 l2r ISO8859-7
hebrew-iso8859-8 96 H 1 r2l ISO8859-8
latin-iso8859-9 96 M 1 l2r ISO8859-9
thai-tis620 96 T 1 l2r TIS620
katakana-jisx0201 94 I 1 l2r JISX0201.1976
latin-jisx0201 94 J 0 l2r JISX0201.1976
japanese-jisx0208-1978 94x94 @ 0 l2r JISX0208.1978
japanese-jisx0208 94x94 B 0 l2r JISX0208.19(83|90)
japanese-jisx0212 94x94 D 0 l2r JISX0212
chinese-gb2312 94x94 A 0 l2r GB2312
chinese-cns11643-1 94x94 G 0 l2r CNS11643.1
chinese-cns11643-2 94x94 H 0 l2r CNS11643.2
chinese-big5-1 94x94 0 0 l2r Big5
chinese-big5-2 94x94 1 0 l2r Big5
korean-ksc5601 94x94 C 0 l2r KSC5601
composite 96x96 0 l2r ---
The following charsets are predefined in the Lisp code.
Name Type Fi Gr Dir Registry
--------------------------------------------------------------
arabic-digit 94 2 0 l2r MuleArabic-0
arabic-1-column 94 3 0 r2l MuleArabic-1
arabic-2-column 94 4 0 r2l MuleArabic-2
sisheng 94 0 0 l2r sisheng_cwnn\|OMRON_UDC_ZH
chinese-cns11643-3 94x94 I 0 l2r CNS11643.1
chinese-cns11643-4 94x94 J 0 l2r CNS11643.1
chinese-cns11643-5 94x94 K 0 l2r CNS11643.1
chinese-cns11643-6 94x94 L 0 l2r CNS11643.1
chinese-cns11643-7 94x94 M 0 l2r CNS11643.1
ethiopic 94x94 2 0 l2r Ethio
ascii-r2l 94 B 0 r2l ISO8859-1
ipa 96 0 1 l2r MuleIPA
vietnamese-lower 96 1 1 l2r VISCII1.1
vietnamese-upper 96 2 1 l2r VISCII1.1
For all of the above charsets, the dimension and number of columns
are the same.
Note that ASCII, Control-1, and Composite are handled specially.
This is why some of the fields are blank; and some of the filled-in
fields (e.g. the type) are not really accurate.
File: lispref.info, Node: MULE Characters, Next: Composite Characters, Prev: Charsets, Up: MULE
MULE Characters
===============
- Function: make-char CHARSET ARG1 &optional ARG2
This function makes a multi-byte character from CHARSET and octets
ARG1 and ARG2.
- Function: char-charset CH
This function returns the character set of char CH.
- Function: char-octet CH &optional N
This function returns the octet (i.e. position code) numbered N
(should be 0 or 1) of char CH. N defaults to 0 if omitted.
- Function: find-charset-region START END &optional BUFFER
This function returns a list of the charsets in the region between
START and END. BUFFER defaults to the current buffer if omitted.
- Function: find-charset-string STRING
This function returns a list of the charsets in STRING.
File: lispref.info, Node: Composite Characters, Next: ISO 2022, Prev: MULE Characters, Up: MULE
Composite Characters
====================
Composite characters are not yet completely implemented.
- Function: make-composite-char STRING
This function converts a string into a single composite character.
The character is the result of overstriking all the characters in
the string.
- Function: composite-char-string CH
This function returns a string of the characters comprising a
composite character.
- Function: compose-region START END &optional BUFFER
This function composes the characters in the region from START to
END in BUFFER into one composite character. The composite
character replaces the composed characters. BUFFER defaults to
the current buffer if omitted.
- Function: decompose-region START END &optional BUFFER
This function decomposes any composite characters in the region
from START to END in BUFFER. This converts each composite
character into one or more characters, the individual characters
out of which the composite character was formed. Non-composite
characters are left as-is. BUFFER defaults to the current buffer
if omitted.
File: lispref.info, Node: ISO 2022, Next: Coding Systems, Prev: Composite Characters, Up: MULE
ISO 2022
========
This section briefly describes the ISO 2022 encoding standard. For
more thorough understanding, please refer to the original document of
ISO 2022.
Character sets ("charsets") are classified into the following four
categories, according to the number of characters of charset:
94-charset, 96-charset, 94x94-charset, and 96x96-charset.
94-charset
ASCII(B), left(J) and right(I) half of JISX0201, ...
96-charset
Latin-1(A), Latin-2(B), Latin-3(C), ...
94x94-charset
GB2312(A), JISX0208(B), KSC5601(C), ...
96x96-charset
none for the moment
The character in parentheses after the name of each charset is the
"final character" F, which can be regarded as the identifier of the
charset. ECMA allocates F to each charset. F is in the range of
0x30..0x7F, but 0x30..0x3F are only for private use.
Note: "ECMA" = European Computer Manufacturers Association
There are four "registers of charsets", called G0 thru G3. You can
designate (or assign) any charset to one of these registers.
The code space contained within one octet (of size 256) is divided
into 4 areas: C0, GL, C1, and GR. GL and GR are the areas into which a
register of charset can be invoked into.
C0: 0x00 - 0x1F
GL: 0x20 - 0x7F
C1: 0x80 - 0x9F
GR: 0xA0 - 0xFF
Usually, in the initial state, G0 is invoked into GL, and G1 is
invoked into GR.
ISO 2022 distinguishes 7-bit environments and 8-bit environments. In
7-bit environments, only C0 and GL are used.
Charset designation is done by escape sequences of the form:
ESC [I] I F
where I is an intermediate character in the range 0x20 - 0x2F, and F
is the final character identifying this charset.
The meaning of intermediate characters are:
$ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
( [0x28]: designate to G0 a 94-charset whose final byte is F.
) [0x29]: designate to G1 a 94-charset whose final byte is F.
* [0x2A]: designate to G2 a 94-charset whose final byte is F.
+ [0x2B]: designate to G3 a 94-charset whose final byte is F.
- [0x2D]: designate to G1 a 96-charset whose final byte is F.
. [0x2E]: designate to G2 a 96-charset whose final byte is F.
/ [0x2F]: designate to G3 a 96-charset whose final byte is F.
The following rule is not allowed in ISO 2022 but can be used in
Mule.
, [0x2C]: designate to G0 a 96-charset whose final byte is F.
Here are examples of designations:
ESC ( B : designate to G0 ASCII
ESC - A : designate to G1 Latin-1
ESC $ ( A or ESC $ A : designate to G0 GB2312
ESC $ ( B or ESC $ B : designate to G0 JISX0208
ESC $ ) C : designate to G1 KSC5601
To use a charset designated to G2 or G3, and to use a charset
designated to G1 in a 7-bit environment, you must explicitly invoke G1,
G2, or G3 into GL. There are two types of invocation, Locking Shift
(forever) and Single Shift (one character only).
Locking Shift is done as follows:
LS0 or SI (0x0F): invoke G0 into GL
LS1 or SO (0x0E): invoke G1 into GL
LS2: invoke G2 into GL
LS3: invoke G3 into GL
LS1R: invoke G1 into GR
LS2R: invoke G2 into GR
LS3R: invoke G3 into GR
Single Shift is done as follows:
SS2 or ESC N: invoke G2 into GL
SS3 or ESC O: invoke G3 into GL
(#### Ben says: I think the above is slightly incorrect. It appears
that SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N
and ESC O behave as indicated. The above definitions will not parse
EUC-encoded text correctly, and it looks like the code in mule-coding.c
has similar problems.)
You may realize that there are a lot of ISO-2022-compliant ways of
encoding multilingual text. Now, in the world, there exist many coding
systems such as X11's Compound Text, Japanese JUNET code, and so-called
EUC (Extended UNIX Code); all of these are variants of ISO 2022.
In Mule, we characterize ISO 2022 by the following attributes:
1. Initial designation to G0 thru G3.
2. Allow designation of short form for Japanese and Chinese.
3. Should we designate ASCII to G0 before control characters?
4. Should we designate ASCII to G0 at the end of line?
5. 7-bit environment or 8-bit environment.
6. Use Locking Shift or not.
7. Use ASCII or JIS0201-1976-Roman.
8. Use JISX0208-1983 or JISX0208-1976.
(The last two are only for Japanese.)
By specifying these attributes, you can create any variant of ISO
2022.
Here are several examples:
junet -- Coding system used in JUNET.
1. G0 <- ASCII, G1..3 <- never used
2. Yes.
3. Yes.
4. Yes.
5. 7-bit environment
6. No.
7. Use ASCII
8. Use JISX0208-1983
ctext -- Compound Text
1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used
2. No.
3. No.
4. Yes.
5. 8-bit environment
6. No.
7. Use ASCII
8. Use JISX0208-1983
euc-china -- Chinese EUC. Although many people call this
as "GB encoding", the name may cause misunderstanding.
1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used
2. No.
3. Yes.
4. Yes.
5. 8-bit environment
6. No.
7. Use ASCII
8. Use JISX0208-1983
korean-mail -- Coding system used in Korean network.
1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used
2. No.
3. Yes.
4. Yes.
5. 7-bit environment
6. Yes.
7. No.
8. No.
Mule creates all these coding systems by default.
File: lispref.info, Node: Coding Systems, Next: CCL, Prev: ISO 2022, Up: MULE
Coding Systems
==============
A coding system is an object that defines how text containing
multiple character sets is encoded into a stream of (typically 8-bit)
bytes. The coding system is used to decode the stream into a series of
characters (which may be from multiple charsets) when the text is read
from a file or process, and is used to encode the text back into the
same format when it is written out to a file or process.
For example, many ISO-2022-compliant coding systems (such as Compound
Text, which is used for inter-client data under the X Window System) use
escape sequences to switch between different charsets - Japanese Kanji,
for example, is invoked with `ESC $ ( B'; ASCII is invoked with `ESC (
B'; and Cyrillic is invoked with `ESC - L'. See `make-coding-system'
for more information.
Coding systems are normally identified using a symbol, and the
symbol is accepted in place of the actual coding system object whenever
a coding system is called for. (This is similar to how faces and
charsets work.)
- Function: coding-system-p OBJECT
This function returns non-`nil' if OBJECT is a coding system.
* Menu:
* Coding System Types:: Classifying coding systems.
* EOL Conversion:: Dealing with different ways of denoting
the end of a line.
* Coding System Properties:: Properties of a coding system.
* Basic Coding System Functions:: Working with coding systems.
* Coding System Property Functions:: Retrieving a coding system's properties.
* Encoding and Decoding Text:: Encoding and decoding text.
* Detection of Textual Encoding:: Determining how text is encoded.
* Big5 and Shift-JIS Functions:: Special functions for these non-standard
encodings.
File: lispref.info, Node: Coding System Types, Next: EOL Conversion, Up: Coding Systems
Coding System Types
-------------------
`nil'
`autodetect'
Automatic conversion. XEmacs attempts to detect the coding system
used in the file.
`no-conversion'
No conversion. Use this for binary files and such. On output,
graphic characters that are not in ASCII or Latin-1 will be
replaced by a `?'. (For a no-conversion-encoded buffer, these
characters will only be present if you explicitly insert them.)
`shift-jis'
Shift-JIS (a Japanese encoding commonly used in PC operating
systems).
`iso2022'
Any ISO-2022-compliant encoding. Among other things, this
includes JIS (the Japanese encoding commonly used for e-mail),
national variants of EUC (the standard Unix encoding for Japanese
and other languages), and Compound Text (an encoding used in X11).
You can specify more specific information about the conversion
with the FLAGS argument.
`big5'
Big5 (the encoding commonly used for Taiwanese).
`ccl'
The conversion is performed using a user-written pseudo-code
program. CCL (Code Conversion Language) is the name of this
pseudo-code.
`internal'
Write out or read in the raw contents of the memory representing
the buffer's text. This is primarily useful for debugging
purposes, and is only enabled when XEmacs has been compiled with
`DEBUG_XEMACS' set (the `--debug' configure option). *Warning*:
Reading in a file using `internal' conversion can result in an
internal inconsistency in the memory representing a buffer's text,
which will produce unpredictable results and may cause XEmacs to
crash. Under normal circumstances you should never use `internal'
conversion.
File: lispref.info, Node: EOL Conversion, Next: Coding System Properties, Prev: Coding System Types, Up: Coding Systems
EOL Conversion
--------------
`nil'
Automatically detect the end-of-line type (LF, CRLF, or CR). Also
generate subsidiary coding systems named `NAME-unix', `NAME-dos',
and `NAME-mac', that are identical to this coding system but have
an EOL-TYPE value of `lf', `crlf', and `cr', respectively.
`lf'
The end of a line is marked externally using ASCII LF. Since this
is also the way that XEmacs represents an end-of-line internally,
specifying this option results in no end-of-line conversion. This
is the standard format for Unix text files.
`crlf'
The end of a line is marked externally using ASCII CRLF. This is
the standard format for MS-DOS text files.
`cr'
The end of a line is marked externally using ASCII CR. This is the
standard format for Macintosh text files.
`t'
Automatically detect the end-of-line type but do not generate
subsidiary coding systems. (This value is converted to `nil' when
stored internally, and `coding-system-property' will return `nil'.)
File: lispref.info, Node: Coding System Properties, Next: Basic Coding System Functions, Prev: EOL Conversion, Up: Coding Systems
Coding System Properties
------------------------
`mnemonic'
String to be displayed in the modeline when this coding system is
active.
`eol-type'
End-of-line conversion to be used. It should be one of the types
listed in *Note EOL Conversion::.
`post-read-conversion'
Function called after a file has been read in, to perform the
decoding. Called with two arguments, BEG and END, denoting a
region of the current buffer to be decoded.
`pre-write-conversion'
Function called before a file is written out, to perform the
encoding. Called with two arguments, BEG and END, denoting a
region of the current buffer to be encoded.
The following additional properties are recognized if TYPE is
`iso2022':
`charset-g0'
`charset-g1'
`charset-g2'
`charset-g3'
The character set initially designated to the G0 - G3 registers.
The value should be one of
* A charset object (designate that character set)
* `nil' (do not ever use this register)
* `t' (no character set is initially designated to the
register, but may be later on; this automatically sets the
corresponding `force-g*-on-output' property)
`force-g0-on-output'
`force-g1-on-output'
`force-g2-on-output'
`force-g3-on-output'
If non-`nil', send an explicit designation sequence on output
before using the specified register.
`short'
If non-`nil', use the short forms `ESC $ @', `ESC $ A', and `ESC $
B' on output in place of the full designation sequences `ESC $ (
@', `ESC $ ( A', and `ESC $ ( B'.
`no-ascii-eol'
If non-`nil', don't designate ASCII to G0 at each end of line on
output. Setting this to non-`nil' also suppresses other
state-resetting that normally happens at the end of a line.
`no-ascii-cntl'
If non-`nil', don't designate ASCII to G0 before control chars on
output.
`seven'
If non-`nil', use 7-bit environment on output. Otherwise, use
8-bit environment.
`lock-shift'
If non-`nil', use locking-shift (SO/SI) instead of single-shift or
designation by escape sequence.
`no-iso6429'
If non-`nil', don't use ISO6429's direction specification.
`escape-quoted'
If non-nil, literal control characters that are the same as the
beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in
particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3
(0x8F), and CSI (0x9B)) are "quoted" with an escape character so
that they can be properly distinguished from an escape sequence.
(Note that doing this results in a non-portable encoding.) This
encoding flag is used for byte-compiled files. Note that ESC is a
good choice for a quoting character because there are no escape
sequences whose second byte is a character from the Control-0 or
Control-1 character sets; this is explicitly disallowed by the ISO
2022 standard.
`input-charset-conversion'
A list of conversion specifications, specifying conversion of
characters in one charset to another when decoding is performed.
Each specification is a list of two elements: the source charset,
and the destination charset.
`output-charset-conversion'
A list of conversion specifications, specifying conversion of
characters in one charset to another when encoding is performed.
The form of each specification is the same as for
`input-charset-conversion'.
The following additional properties are recognized (and required) if
TYPE is `ccl':
`decode'
CCL program used for decoding (converting to internal format).
`encode'
CCL program used for encoding (converting to external format).